The Failure of Exactitude
While a high-degree polynomial can hit every data point, it often results in "Runge-like" oscillations. These wild swings bear no resemblance to the underlying physical process. So it is unreasonable to require that the approximating function agree exactly with the data, especially when measurements are subject to variance.
Defining the 'Best' Fit: The Three Norms
To approximate, we must define an error function $E$. How we measure "closeness" changes the result entirely:
Seeking to minimize the maximum possible error:
$$E_{\infty}(a_0, a_1) = \max_{1 \le i \le n} \{|y_i - (a_1 x_i + a_0)|\}$$
Pitfall: The minimax approach generally assigns too much weight to a bit of data that is badly in error.
The sum of absolute differences:
$$E_1(a_0, a_1) = \sum_{i=1}^{n} |y_i - (a_1 x_i + a_0)|$$
Pitfall: The absolute-value function is not differentiable at zero, and we might not be able to find solutions to this pair of equations analytically.
The standard in numerical analysis, squaring the residuals:
$$E_2(a_0, a_1) = \sum_{i=1}^{n} [y_i - (a_1 x_i + a_0)]^2$$
This creates a smooth, differentiable surface where calculus can easily find a global minimum.
Analytical Constraints
Choosing a metric is a balance of logic and calculus. For example, the absolute deviation method does not give sufficient weight to a point that is considerably out of line with the approximation, while $L_2$ provides a robust middle ground that penalizes large outliers without being entirely governed by a single rogue data point.